PM 566: Homework 01

Author

Tarun Mahesh

PM-566 | Introduction to Health Data Science

Homework 01

Primary question to answer- Has the daily concentrations of PM2.5 decreased in California in the 20 years spanning 2002 - 2022?

Why answer this question?

PM2.5 is highly dangerous to health, and a quick glance at the epa.gov website and Google explains that the primary health risks associated with the inhalation of PM2.5 include respiratory issues like asthma exacerbation and bronchitis, cardiovascular problems and even premature death in highly sensitive individuals.

Answering whether PM2.5 has decreased in the state of California over the 2-decade span between 2002 and 2022 will help us analyse if the EPA and US Government’s efforts to improve air quality have been successful, and can provide a valuable starting point for any efforts we take moving forward to reduce air pollution.

STEP 1:

EDA Checklist items 1 through 5.

  • 1.1: Read in the data

  • 1.2: Check the size of the data

  • 1.3: Examine the variables and their types

  • 1.4: Look at the top and bottom of the data

  • 1.5: Visualize the distribution of the key variables

The aim is to perform steps 1.1-1.5 for the 2002 and 2022 data from the EPA website.

1.1: Read in the data + 1.2: Check data size [The files for 2002 and 2022 (California, All Sites) were downloaded from the EPA website, and the csv files were renamed to “2022_EPA” and “2002_EPA”, and saved into the .gitignore folder in my repository.]

# Loading Required Libraries

# Reading Data
epa22 <- read.csv("data/2022_EPA.csv")
epa02 <- read.csv("data/2002_EPA.csv")

# Checking Dimensions
dim(epa22)
[1] 59918    22
dim(epa02)
[1] 15976    22

EPA 2022 data has 59918 rows and 22 columns EPA 2002 data has 15976 rows and 22 columns

1.3: Variables and their types

# Variable Names and Types
print(paste("EPA 2022 Variables:", str(epa22)))
'data.frame':   59918 obs. of  22 variables:
 $ Date                          : chr  "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site.ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Daily.Mean.PM2.5.Concentration: num  12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily.AQI.Value               : int  58 60 39 21 23 21 13 38 59 55 ...
 $ Local.Site.Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily.Obs.Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent.Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS.Parameter.Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS.Parameter.Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method.Code                   : int  170 170 170 170 170 170 170 170 170 170 ...
 $ Method.Description            : chr  "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
 $ CBSA.Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA.Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State.FIPS.Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County.FIPS.Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site.Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site.Longitude                : num  -122 -122 -122 -122 -122 ...
[1] "EPA 2022 Variables: "
print(paste("EPA 2002 Variables:", str(epa02)))
'data.frame':   15976 obs. of  22 variables:
 $ Date                          : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site.ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Daily.Mean.PM2.5.Concentration: num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily.AQI.Value               : int  81 93 74 82 98 115 89 62 69 107 ...
 $ Local.Site.Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily.Obs.Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent.Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS.Parameter.Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS.Parameter.Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method.Code                   : int  120 120 120 120 120 120 120 120 120 120 ...
 $ Method.Description            : chr  "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ...
 $ CBSA.Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA.Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State.FIPS.Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County.FIPS.Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site.Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site.Longitude                : num  -122 -122 -122 -122 -122 ...
[1] "EPA 2002 Variables: "

Both data files (2022 and 2002) have the same 22 variables of the same data type

1.4: Top and bottom of the data

# Headers and Footers
head(epa22)
        Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
1 01/01/2022    AQS 60010007   3                           12.7 ug/m3 LC
2 01/02/2022    AQS 60010007   3                           13.9 ug/m3 LC
3 01/03/2022    AQS 60010007   3                            7.1 ug/m3 LC
4 01/04/2022    AQS 60010007   3                            3.7 ug/m3 LC
5 01/05/2022    AQS 60010007   3                            4.2 ug/m3 LC
6 01/06/2022    AQS 60010007   3                            3.8 ug/m3 LC
  Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
1              58       Livermore               1              100
2              60       Livermore               1              100
3              39       Livermore               1              100
4              21       Livermore               1              100
5              23       Livermore               1              100
6              21       Livermore               1              100
  AQS.Parameter.Code AQS.Parameter.Description Method.Code
1              88101  PM2.5 - Local Conditions         170
2              88101  PM2.5 - Local Conditions         170
3              88101  PM2.5 - Local Conditions         170
4              88101  PM2.5 - Local Conditions         170
5              88101  PM2.5 - Local Conditions         170
6              88101  PM2.5 - Local Conditions         170
                    Method.Description CBSA.Code
1 Met One BAM-1020 Mass Monitor w/VSCC     41860
2 Met One BAM-1020 Mass Monitor w/VSCC     41860
3 Met One BAM-1020 Mass Monitor w/VSCC     41860
4 Met One BAM-1020 Mass Monitor w/VSCC     41860
5 Met One BAM-1020 Mass Monitor w/VSCC     41860
6 Met One BAM-1020 Mass Monitor w/VSCC     41860
                          CBSA.Name State.FIPS.Code      State County.FIPS.Code
1 San Francisco-Oakland-Hayward, CA               6 California                1
2 San Francisco-Oakland-Hayward, CA               6 California                1
3 San Francisco-Oakland-Hayward, CA               6 California                1
4 San Francisco-Oakland-Hayward, CA               6 California                1
5 San Francisco-Oakland-Hayward, CA               6 California                1
6 San Francisco-Oakland-Hayward, CA               6 California                1
   County Site.Latitude Site.Longitude
1 Alameda      37.68753      -121.7842
2 Alameda      37.68753      -121.7842
3 Alameda      37.68753      -121.7842
4 Alameda      37.68753      -121.7842
5 Alameda      37.68753      -121.7842
6 Alameda      37.68753      -121.7842
tail(epa22)
            Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
59913 12/01/2022    AQS 61131003   1                            3.4 ug/m3 LC
59914 12/07/2022    AQS 61131003   1                            3.8 ug/m3 LC
59915 12/13/2022    AQS 61131003   1                            6.0 ug/m3 LC
59916 12/19/2022    AQS 61131003   1                           34.8 ug/m3 LC
59917 12/25/2022    AQS 61131003   1                           23.2 ug/m3 LC
59918 12/31/2022    AQS 61131003   1                            1.0 ug/m3 LC
      Daily.AQI.Value      Local.Site.Name Daily.Obs.Count Percent.Complete
59913              19 Woodland-Gibson Road               1              100
59914              21 Woodland-Gibson Road               1              100
59915              33 Woodland-Gibson Road               1              100
59916              99 Woodland-Gibson Road               1              100
59917              77 Woodland-Gibson Road               1              100
59918               6 Woodland-Gibson Road               1              100
      AQS.Parameter.Code AQS.Parameter.Description Method.Code
59913              88101  PM2.5 - Local Conditions         145
59914              88101  PM2.5 - Local Conditions         145
59915              88101  PM2.5 - Local Conditions         145
59916              88101  PM2.5 - Local Conditions         145
59917              88101  PM2.5 - Local Conditions         145
59918              88101  PM2.5 - Local Conditions         145
                                         Method.Description CBSA.Code
59913 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59914 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59915 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59916 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59917 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59918 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                    CBSA.Name State.FIPS.Code      State
59913 Sacramento--Roseville--Arden-Arcade, CA               6 California
59914 Sacramento--Roseville--Arden-Arcade, CA               6 California
59915 Sacramento--Roseville--Arden-Arcade, CA               6 California
59916 Sacramento--Roseville--Arden-Arcade, CA               6 California
59917 Sacramento--Roseville--Arden-Arcade, CA               6 California
59918 Sacramento--Roseville--Arden-Arcade, CA               6 California
      County.FIPS.Code County Site.Latitude Site.Longitude
59913              113   Yolo      38.66121      -121.7327
59914              113   Yolo      38.66121      -121.7327
59915              113   Yolo      38.66121      -121.7327
59916              113   Yolo      38.66121      -121.7327
59917              113   Yolo      38.66121      -121.7327
59918              113   Yolo      38.66121      -121.7327
head(epa02)
        Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
1 01/05/2002    AQS 60010007   1                           25.1 ug/m3 LC
2 01/06/2002    AQS 60010007   1                           31.6 ug/m3 LC
3 01/08/2002    AQS 60010007   1                           21.4 ug/m3 LC
4 01/11/2002    AQS 60010007   1                           25.9 ug/m3 LC
5 01/14/2002    AQS 60010007   1                           34.5 ug/m3 LC
6 01/17/2002    AQS 60010007   1                           41.0 ug/m3 LC
  Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
1              81       Livermore               1              100
2              93       Livermore               1              100
3              74       Livermore               1              100
4              82       Livermore               1              100
5              98       Livermore               1              100
6             115       Livermore               1              100
  AQS.Parameter.Code AQS.Parameter.Description Method.Code
1              88101  PM2.5 - Local Conditions         120
2              88101  PM2.5 - Local Conditions         120
3              88101  PM2.5 - Local Conditions         120
4              88101  PM2.5 - Local Conditions         120
5              88101  PM2.5 - Local Conditions         120
6              88101  PM2.5 - Local Conditions         120
                     Method.Description CBSA.Code
1 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
2 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
3 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
4 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
5 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
6 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
                          CBSA.Name State.FIPS.Code      State County.FIPS.Code
1 San Francisco-Oakland-Hayward, CA               6 California                1
2 San Francisco-Oakland-Hayward, CA               6 California                1
3 San Francisco-Oakland-Hayward, CA               6 California                1
4 San Francisco-Oakland-Hayward, CA               6 California                1
5 San Francisco-Oakland-Hayward, CA               6 California                1
6 San Francisco-Oakland-Hayward, CA               6 California                1
   County Site.Latitude Site.Longitude
1 Alameda      37.68753      -121.7842
2 Alameda      37.68753      -121.7842
3 Alameda      37.68753      -121.7842
4 Alameda      37.68753      -121.7842
5 Alameda      37.68753      -121.7842
6 Alameda      37.68753      -121.7842
tail(epa02)
            Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
15971 12/10/2002    AQS 61131003   1                             15 ug/m3 LC
15972 12/13/2002    AQS 61131003   1                             15 ug/m3 LC
15973 12/22/2002    AQS 61131003   1                              1 ug/m3 LC
15974 12/25/2002    AQS 61131003   1                             23 ug/m3 LC
15975 12/28/2002    AQS 61131003   1                              5 ug/m3 LC
15976 12/31/2002    AQS 61131003   1                              6 ug/m3 LC
      Daily.AQI.Value      Local.Site.Name Daily.Obs.Count Percent.Complete
15971              62 Woodland-Gibson Road               1              100
15972              62 Woodland-Gibson Road               1              100
15973               6 Woodland-Gibson Road               1              100
15974              77 Woodland-Gibson Road               1              100
15975              28 Woodland-Gibson Road               1              100
15976              33 Woodland-Gibson Road               1              100
      AQS.Parameter.Code AQS.Parameter.Description Method.Code
15971              88101  PM2.5 - Local Conditions         117
15972              88101  PM2.5 - Local Conditions         117
15973              88101  PM2.5 - Local Conditions         117
15974              88101  PM2.5 - Local Conditions         117
15975              88101  PM2.5 - Local Conditions         117
15976              88101  PM2.5 - Local Conditions         117
                         Method.Description CBSA.Code
15971 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15972 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15973 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15974 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15975 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15976 R & P Model 2000 PM2.5 Sampler w/WINS     40900
                                    CBSA.Name State.FIPS.Code      State
15971 Sacramento--Roseville--Arden-Arcade, CA               6 California
15972 Sacramento--Roseville--Arden-Arcade, CA               6 California
15973 Sacramento--Roseville--Arden-Arcade, CA               6 California
15974 Sacramento--Roseville--Arden-Arcade, CA               6 California
15975 Sacramento--Roseville--Arden-Arcade, CA               6 California
15976 Sacramento--Roseville--Arden-Arcade, CA               6 California
      County.FIPS.Code County Site.Latitude Site.Longitude
15971              113   Yolo      38.66121      -121.7327
15972              113   Yolo      38.66121      -121.7327
15973              113   Yolo      38.66121      -121.7327
15974              113   Yolo      38.66121      -121.7327
15975              113   Yolo      38.66121      -121.7327
15976              113   Yolo      38.66121      -121.7327

EPA 2022 data ranges from 01/01/2022- 12/31/2022 & values are stored under the variable “Date”. There seems to be a sudden spike in PM2.5 values on the 19th and 25th of December.

EPA 2002 data ranges from 01/05/2002- 12/31/2002 & values are stored under the variable “Date”. The air quality data in 2002, upon initial inspection, seems to indicate higher PM2.5 at the start of the year and lowered levels towards the end, and a similar spike is seen on the 25th of December.

1.5 Visualizing the distribution of the key variable- PM2.5 To visualize the distribution of PM2.5, we can plot a quick histogram:

# 2022 Data
hist(epa22$Daily.Mean.PM2.5.Concentration, breaks = 40, 
     main = "PM2.5 in 2022 CA", xlab = "PM2.5 ug/m3")

hist(epa02$Daily.Mean.PM2.5.Concentration, breaks = 40, 
     main = "PM2.5 in 2002 CA", xlab = "PM2.5 ug/m3")

It looks like the PM2.5 in 2022 lies strongly towards the 0-50 range, with a majority (visually) between 0-10 ug/m3. In 2002, a lot of observations fall between 0-40 ug/m3, but some values lie beyond 40 ug/m3.

The EDA steps 1-5 are now complete.

STEP 2:

Combining 2002 & 2022 data, and changing the name of the PM2.5 variable.

Since the variables and their types match 1-1 between 2002 and 2022, it makes sense to combine them into the same data frame for convenience. And, since the PM2.5 variable is currently too long to reference each time (“Daily.Mean.PM2.5.Concentration”), it makes it easier to change the name to something more accessible. I am choosing “MeanPM”. I will be using Roger Peng’s EDA manual to write some of my code.

2.1 Combining both data frames

# We need dplyr for this action 
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
# Using the bind_rows function to merge both data frames 
epa_all <- bind_rows(epa02, epa22)

# Checking if the binding was successful using head and tail 
head(epa_all)
        Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
1 01/05/2002    AQS 60010007   1                           25.1 ug/m3 LC
2 01/06/2002    AQS 60010007   1                           31.6 ug/m3 LC
3 01/08/2002    AQS 60010007   1                           21.4 ug/m3 LC
4 01/11/2002    AQS 60010007   1                           25.9 ug/m3 LC
5 01/14/2002    AQS 60010007   1                           34.5 ug/m3 LC
6 01/17/2002    AQS 60010007   1                           41.0 ug/m3 LC
  Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
1              81       Livermore               1              100
2              93       Livermore               1              100
3              74       Livermore               1              100
4              82       Livermore               1              100
5              98       Livermore               1              100
6             115       Livermore               1              100
  AQS.Parameter.Code AQS.Parameter.Description Method.Code
1              88101  PM2.5 - Local Conditions         120
2              88101  PM2.5 - Local Conditions         120
3              88101  PM2.5 - Local Conditions         120
4              88101  PM2.5 - Local Conditions         120
5              88101  PM2.5 - Local Conditions         120
6              88101  PM2.5 - Local Conditions         120
                     Method.Description CBSA.Code
1 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
2 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
3 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
4 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
5 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
6 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
                          CBSA.Name State.FIPS.Code      State County.FIPS.Code
1 San Francisco-Oakland-Hayward, CA               6 California                1
2 San Francisco-Oakland-Hayward, CA               6 California                1
3 San Francisco-Oakland-Hayward, CA               6 California                1
4 San Francisco-Oakland-Hayward, CA               6 California                1
5 San Francisco-Oakland-Hayward, CA               6 California                1
6 San Francisco-Oakland-Hayward, CA               6 California                1
   County Site.Latitude Site.Longitude
1 Alameda      37.68753      -121.7842
2 Alameda      37.68753      -121.7842
3 Alameda      37.68753      -121.7842
4 Alameda      37.68753      -121.7842
5 Alameda      37.68753      -121.7842
6 Alameda      37.68753      -121.7842
tail(epa_all)
            Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
75889 12/01/2022    AQS 61131003   1                            3.4 ug/m3 LC
75890 12/07/2022    AQS 61131003   1                            3.8 ug/m3 LC
75891 12/13/2022    AQS 61131003   1                            6.0 ug/m3 LC
75892 12/19/2022    AQS 61131003   1                           34.8 ug/m3 LC
75893 12/25/2022    AQS 61131003   1                           23.2 ug/m3 LC
75894 12/31/2022    AQS 61131003   1                            1.0 ug/m3 LC
      Daily.AQI.Value      Local.Site.Name Daily.Obs.Count Percent.Complete
75889              19 Woodland-Gibson Road               1              100
75890              21 Woodland-Gibson Road               1              100
75891              33 Woodland-Gibson Road               1              100
75892              99 Woodland-Gibson Road               1              100
75893              77 Woodland-Gibson Road               1              100
75894               6 Woodland-Gibson Road               1              100
      AQS.Parameter.Code AQS.Parameter.Description Method.Code
75889              88101  PM2.5 - Local Conditions         145
75890              88101  PM2.5 - Local Conditions         145
75891              88101  PM2.5 - Local Conditions         145
75892              88101  PM2.5 - Local Conditions         145
75893              88101  PM2.5 - Local Conditions         145
75894              88101  PM2.5 - Local Conditions         145
                                         Method.Description CBSA.Code
75889 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
75890 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
75891 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
75892 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
75893 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
75894 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                    CBSA.Name State.FIPS.Code      State
75889 Sacramento--Roseville--Arden-Arcade, CA               6 California
75890 Sacramento--Roseville--Arden-Arcade, CA               6 California
75891 Sacramento--Roseville--Arden-Arcade, CA               6 California
75892 Sacramento--Roseville--Arden-Arcade, CA               6 California
75893 Sacramento--Roseville--Arden-Arcade, CA               6 California
75894 Sacramento--Roseville--Arden-Arcade, CA               6 California
      County.FIPS.Code County Site.Latitude Site.Longitude
75889              113   Yolo      38.66121      -121.7327
75890              113   Yolo      38.66121      -121.7327
75891              113   Yolo      38.66121      -121.7327
75892              113   Yolo      38.66121      -121.7327
75893              113   Yolo      38.66121      -121.7327
75894              113   Yolo      38.66121      -121.7327

head(epa_all)- Data from 2002 visible tail(epa_all)- Data from 2022 visible Interpretation- The merge was successful.

2.2 Creating a column for “Year” from “Date”

# We need lubridate for this action
library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
epa_all$Year <- year(mdy(epa_all$Date)) # The reason is used mdy here is 
# because Date is stored as a character value in the original data frame

# Checking to see if it worked
head(epa_all$Year)
[1] 2002 2002 2002 2002 2002 2002
tail(epa_all$Year)
[1] 2022 2022 2022 2022 2022 2022

2.3 Renaming the PM2.5 variable

# We need dplyr for this action 
library(dplyr)

# Renaming the column 
epa_all <- rename(epa_all, MeanPM = Daily.Mean.PM2.5.Concentration) 

# Checking to see if it worked 
head(epa_all$MeanPM)
[1] 25.1 31.6 21.4 25.9 34.5 41.0
tail(epa_all$MeanPM)
[1]  3.4  3.8  6.0 34.8 23.2  1.0

The MeanPM (previously Daily.Mean.PM2.5.Concentration) column now reflects the PM2.5 values.

STEP 3:

Creating a basic map(s) of the locations of the monitoring sites, using different colors for each year. For this step, I will be using leaflet.

# Loading in the libraries I will need 
library(dplyr)
library(leaflet)

sites <- epa_all %>% distinct(Year, Site.ID, Site.Latitude, Site.Longitude)

leaflet() %>%
  addTiles() %>%
  addCircleMarkers(data = filter(sites, Year == 2002), lng = ~Site.Longitude, lat = ~Site.Latitude, color = "blue", fillOpacity = 0, radius = 6) %>%
  addCircleMarkers(data = filter(sites, Year == 2022), lng = ~Site.Longitude, lat = ~Site.Latitude, color = "red", fillOpacity = 0.9, radius = 5)

Summary: The 2002 and 2022 data are well distributed throughout the state of California, but the 2022 data is much denser, and has more data points as compared to 2002. I would say the overall distribution is stable. Since I made the 2002 data points ring-shaped, I can easily visualize the overlapping sites on my map.

STEP 4:

Finding missing or implausible values of PM2.5, calculate the proportions of such data per year and then finding the temporal distribution of missing/implausible data.

For this, we can 1. Find the number of missing/NA values per year. 2. Define a plausible range for PM2.5, and then find the values that lie outside of it. 3. Find the total number of PM2.5 values (including the NA and implausible entries) for the years 2002 and 2022, and then use a simple division to find the proportion.

# NA values per year: Using dplyr to summarise the NA values per year

library(dplyr)

epa_all %>% group_by(Year) %>% summarise(missing_PM = sum(is.na(MeanPM)), 
                                         total = n(), 
                                         propor_missing = missing_PM / total)
# A tibble: 2 × 4
   Year missing_PM total propor_missing
  <dbl>      <int> <int>          <dbl>
1  2002          0 15976              0
2  2022          0 59918              0

There seems to be no NA values in either year, which is a good sign.

“Implausible” values of PM2.5 can vary based on the definition of implausibility. Any zero or negative PM2.5 value does not make sense logically, and thus, the lower range of implausibility can be set at <= 0. For the upper range of implausibility, EPA standards suggest that any PM2.5 values greater than 200 are highly unhealthy (hazardous above 300). My immediate thought, is that the State of California has historically poor air quality metrics, and also experiences regular wildfires which can add to the PM2.5 levels. So, in order to decide if an upper level is to be set for implausibility, I would like to look at the dates when PM2.5 > 200 were observed, and see if those dates correlate with any known high-pollution events during that year in California.

MeanPMImp22 <- epa_all[epa_all$MeanPM > 200 & epa_all$Year == 2022, ]
MeanPMImp02 <- epa_all[epa_all$MeanPM > 200 & epa_all$Year == 2002, ]

dim(MeanPMImp22)
[1]  7 23
dim(MeanPMImp02)
[1]  0 23

We see that there are no PM2.5 values > 200 in 2002, and 7 such values in 2022. Let’s try and see what dates were these 7 values were recorded.

library(dplyr)
library(ggplot2)

MeanPMImp22 <- MeanPMImp22 %>% select(MeanPM, Date, County)
View(MeanPMImp22)

epa_22 <- epa_all %>% filter(Year == 2022)

ggplot(epa_22, aes(x = as.Date(Date, format = "%m/%d/%Y"), y = MeanPM)) + 
  geom_line(color = "blue") + 
  geom_point(data = subset(epa_22, MeanPM > 200), 
  aes(x = as.Date(Date, format = "%m/%d/%Y"), y = MeanPM), 
  color = "red") + 
  labs(title = "Daily Mean PM2.5 in CA, 2022", x = "Date", y = "PM2.5 ug/m3")

A quick glance at the Calfire website, searching for fire incidents in these counties in 2022 shows me that there have been wildfires/structural fires close to the recorded dates of PM values > 200, so I will retain these values, and only remove <= 0 PM2.5 values from epa_all.

library(dplyr)
epa_all <- filter(epa_all, MeanPM > 0)
summary(epa_all$MeanPM)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.10    4.50    7.60   10.08   12.20  302.50 

We now see that the new minimum for epa_all is 0.10 and that all rows where MeanPM <= 0 have been removed.

STEP 5:

Exploring the main question of interest (whether PM2.5 values have reduced between 2002 and 2022 in California) at three levels of spatial resolution

5.1: State

For the state-level visualization, it would be useful to see a boxplot distributed by year.

# Required libraries 
library(dplyr)
library(ggplot2)
library(tidyr)

ggplot(epa_all, aes(factor(epa_all$Year), epa_all$MeanPM)) + geom_boxplot() + 
  labs(x = "Year", y = "PM2.5 (ug/m3)", 
       title = "California Daily PM2.5 by year")
Warning: Use of `epa_all$Year` is discouraged.
ℹ Use `Year` instead.
Warning: Use of `epa_all$MeanPM` is discouraged.
ℹ Use `MeanPM` instead.

We could also visualise this as a violin plot

ggplot(epa_all, aes(factor(Year), MeanPM)) + 
  geom_violin(fill = "lightblue", color = "black") + 
  labs(x = "Year", y = "PM2.5 ug/m3", 
       title = "California Daily PM2.5 Distribution by Year")

Now for some summary statistics:

state_level <- epa_all %>% group_by(Year) %>% 
  summarise(mean_PM = mean(MeanPM, na.rm = TRUE), 
            median_PM = median(MeanPM, na.rm = TRUE), 
            min_PM = min(MeanPM, na.rm = TRUE), 
            max_PM = max(MeanPM, na.rm = TRUE)) 

print(state_level)
# A tibble: 2 × 5
   Year mean_PM median_PM min_PM max_PM
  <dbl>   <dbl>     <dbl>  <dbl>  <dbl>
1  2002   16.1       12      0.1   104.
2  2022    8.47       6.9    0.1   302.

Interpretation: Both mean and median PM2.5 values decreased in 2022 compared to 2002, even though the maximum PM2.5 recorded in 2022 was much higher than 2002. The violin plot has a long tail, which represents the extreme values of PM2.5 recorded in 2022.

5.2: County

For the county-level visualization, it might be useful to look at a slopegraph, which is commonly used in biological data analyses to look at changes in gene expression levels (for small sets of genes) over time. (slopegraphs aren’t usually used for large amounts of data (like tens of thousands of genes), because the plot begins to look messy). The question here, is whether to plot median or mean PM2.5 values. Median usually discounts extreme outliers and represents the way data looks independent of small data outliers, while using mean can skew the plotted observation to either extreme, but has the advantage of capturing the variability in the data. I would like to plot both, and compare the two.

library(stringr)

county_data <- epa_all %>% group_by(County, Year) %>% 
  summarise(med = median(MeanPM), .groups="drop")

ggplot(county_data, aes(x = factor(Year), y = med, group = County)) + 
  geom_line() + geom_point() + labs(x = NULL, y = "Median PM2.5 ug/m3", 
                                    title = "County medians: 2002 to 2022") + 
  theme(axis.text.x = element_text(size = 6), 
        axis.text.y = element_text(size = 6), 
        strip.text = element_text(size = 7))

# To also make this plot a layout, with one plot for each county 

ggplot(county_data, aes(x = factor(Year), y = med, group = County)) + 
  geom_line() + geom_point() + facet_wrap(~ str_wrap(County, width = 30)) + 
  labs(x = NULL, y = "Median PM2.5 ug/m3", title = "County medians by year") + 
  theme(axis.text.x = element_text(size = 6), 
        axis.text.y = element_text(size = 6), 
        strip.text = element_text(size = 7))
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

# The same plots, but using Mean

county_data2 <- epa_all %>% group_by(County, Year) %>% 
  summarise(men = mean(MeanPM), .groups = "drop")

ggplot(county_data2, aes(x = factor(Year), y = men, group = County)) + 
  geom_line() + geom_point() + labs(x = NULL, y = "Mean PM2.5 ug/m3", 
                                    title = "County average: 2002 to 2022") + 
  theme(axis.text.x = element_text(size = 6), 
        axis.text.y = element_text(size = 6), 
        strip.text = element_text(size = 7))

ggplot(county_data2, aes(x = factor(Year), y = men, group = County)) + 
  geom_line() + geom_point() + facet_wrap(~ str_wrap(County, width = 30)) + 
  labs(x = NULL, y = "Mean PM2.5 ug/m3", title = "County means by year") + 
  theme(axis.text.x = element_text(size = 6), 
        axis.text.y = element_text(size = 6), 
        strip.text = element_text(size = 7))
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Summarizing the county-level data:

library(stringr)

county_level <- epa_all %>% group_by(County, Year) %>% 
  summarise(mean_PM = mean(MeanPM, na.rm = TRUE), 
            median_PM = median(MeanPM, na.rm = TRUE), 
            .groups = "drop") 

most_change_CL <- county_level %>% 
  pivot_wider(names_from = Year, values_from = c(mean_PM, median_PM)) %>% 
  mutate(mean_change= mean_PM_2002 - mean_PM_2022, 
         median_change = median_PM_2002 - median_PM_2022) 

most_change_mean <- most_change_CL %>% arrange(desc(mean_change)) %>% 
  select(County, mean_change)
most_change_median <- most_change_CL %>% 
  arrange(desc(median_change)) %>% select(County, median_change)

print(county_level)
# A tibble: 98 × 4
   County        Year mean_PM median_PM
   <chr>        <dbl>   <dbl>     <dbl>
 1 Alameda       2002   14.3       10  
 2 Alameda       2022    8.22       7  
 3 Butte         2002   14.8       11.5
 4 Butte         2022    6.29       4.5
 5 Calaveras     2002    9.9        8  
 6 Calaveras     2022    6.05       5  
 7 Colusa        2002   11.7        9  
 8 Colusa        2022    7.61       6.7
 9 Contra Costa  2002   15.1        9.5
10 Contra Costa  2022    8.24       7.2
# ℹ 88 more rows
print(most_change_mean)
# A tibble: 51 × 2
   County      mean_change
   <chr>             <dbl>
 1 Riverside         13.2 
 2 Kern              13.0 
 3 Tulare            12.6 
 4 Merced            11.5 
 5 Kings             10.3 
 6 Fresno             9.72
 7 San Joaquin        8.73
 8 Los Angeles        8.67
 9 Orange             8.57
10 Butte              8.47
# ℹ 41 more rows
print(most_change_median)
# A tibble: 51 × 2
   County      median_change
   <chr>               <dbl>
 1 Tulare               11.3
 2 Riverside             9.7
 3 Kern                  9.5
 4 Merced                9.1
 5 Kings                 7.4
 6 Los Angeles           7.1
 7 Butte                 7  
 8 San Joaquin           6.6
 9 Ventura               6.6
10 Orange                6.5
# ℹ 41 more rows

Interpretation: Visually, the slopegraphs show that in most counties in California, the median and mean PM2.5 reduced in 2022 compared to 2002. Using pivot_wider, we can find the county with the most change in median and mean PM2.5. Since in pivot_wider I subtract the mean/median PM of 2022 from 2002, the most positive values represent the most decrease in PM2.5 (mean/median) in 2022.

Riverside saw the best improvement in mean PM2.5, while Tulare saw the best improvement in median PM2.5. This is really interesting, as it seems to highlight the difference between using mean vs. median to analyze data.

5.3: City (Los Angeles County Sites)

For the city-level visualization, I see two different identifiers for Sites- “Local.Site.Name” and “Site.ID”. I want to see how the data is recorded for both variables and decide which one to use.

table(epa_all$Local.Site.Name)

                                            
                                        118 
                                   29 Palms 
                                        767 
                    3425 N FIRST ST, FRESNO 
                                        499 
                                 Agua Tibia 
                                        168 
                                     Alpine 
                                        467 
                         Alturas-Fourth St. 
                                          2 
                                    Anaheim 
                                       1120 
                                 Atascadero 
                                        345 
                 Atascadero (original site) 
                                        116 
                              Auburn-Atwood 
                                        344 
                                      Azusa 
                                        415 
                Bakersfield-Airport (Planz) 
                                        231 
                     Bakersfield-California 
                                       2007 
                  Bakersfield-Golden / M St 
                                        420 
                            Banning Airport 
                                        365 
                                   Big Bear 
                                        450 
                           Bishop Tribe EMO 
                                        704 
                                   Bliss SP 
                                        113 
                            Bliss SP (TRPA) 
                                        117 
                    Brawley-220 Main Street 
                                        348 
                       Brawley-401 Main St. 
                                         85 
                                    Burbank 
                                        122 
                      Calexico-Ethel Street 
                                        613 
                             Camp Pendleton 
                                        479 
                   Campo Indian Reservation 
                                        724 
                              Carmel Valley 
                                        343 
                                        CDF 
                                        359 
                      Chester-222 First Ave 
                                        299 
                          Chico-East Avenue 
                                        409 
                       Chico-Manzanita Ave. 
                                        120 
                                Chula Vista 
                                        218 
                               Clovis-Villa 
                                        443 
                           Colfax-City Hall 
                                        358 
                        Colusa-Sunrise Blvd 
                                        454 
                                    Compton 
                                        723 
                                    Concord 
                                        739 
                         Corcoran-Patterson 
                                        443 
                   Cortina Indian Rancheria 
                                         42 
          Crescent City-Crescent Elk School 
                                        332 
                                  Crestline 
                                        178 
                           Davis-UCD Campus 
                                        312 
             Death Valley NP - Park Village 
                                        115 
                                    Donovan 
                                        507 
                                Echo Summit 
                                          2 
                                   El Cajon 
                                        416 
     El Cajon - Lexington Elementary School 
                                        631 
                       El Centro-9th Street 
                                        450 
                  El Rio-Rio Mesa School #2 
                                        469 
                                  Escondido 
                                        355 
                            Eureka I Street 
                                         59 
                          Folsom-Natoma St. 
                                        689 
                                    Fontana 
                                        298 
                       Fremont - Chapel Way 
                                        105 
                           Fresno - Garland 
                                        533 
                             Fresno-Foundry 
                                        357 
                             Fresno-Pacific 
                                        443 
                                     Gilroy 
                                        340 
                                   Glendora 
                                        361 
                                     Goleta 
                                        359 
               Grass Valley-Litton Building 
                                        404 
                              Hanford-Irwin 
                                        361 
                                  Hollister 
                                        356 
                                     Hoover 
                                        228 
                                      Huron 
                                        357 
                                      Indio 
                                        240 
                                     Jacobs 
                                        116 
                             Joshua Tree NP 
                                        215 
                                     Kaiser 
                                        195 
                                Kearny Mesa 
                                        111 
                                     Keeler 
                                        635 
                                King City 2 
                                        344 
                              Lake Elsinore 
                                        360 
               Lake Tahoe Community College 
                                        111 
                    Lakeport-Lakeport Blvd. 
                                         61 
                    Lakeport-S. Main Street 
                                         61 
                  Lancaster-Division Street 
                                        455 
                              Laney College 
                                        360 
                         Lassen Volcanic NP 
                                        209 
                Lava Beds National Monument 
                                        216 
                                      Lebec 
                                        490 
                    Lebec-Peace Valley Road 
                                         93 
      Lebec-Peace Valley/Frazier Park Roads 
                                        106 
                                 Lee Vining 
                                        716 
                    Lincoln-2885 Moore Road 
                                        357 
                                  Livermore 
                                        439 
                            Lompoc H Street 
                                        351 
      Lone Pine Paiute-Shoshone Reservation 
                                        604 
                         Long Beach (North) 
                                        411 
                         Long Beach (South) 
                                        239 
             Long Beach-Route 710 Near Road 
                                        574 
              Los Angeles-North Main Street 
                                       1271 
                                    Lynwood 
                                        122 
                                Madera-City 
                                        360 
                              Mammoth Lakes 
                                        110 
                                    Manteca 
                                        308 
                              Merced-Coffee 
                                        359 
                                Merced-M St 
                                        449 
                                      Mesa2 
                                        357 
                      Mira Loma (Van Buren) 
                                        777 
                              Mission Viejo 
                                        227 
                        Modesto-14th Street 
                                        587 
                                     Mojave 
                                        100 
                    Mojave - CA 58 Business 
                                        355 
             Morongo Air Monitoring Station 
                                        370 
                     North Hollywood (NOHO) 
                                        365 
                                    Oakland 
                                        365 
                               Oakland West 
                                        359 
                       Ojai - East Ojai Ave 
                                        360 
                       Ontario Fire Station 
                                        111 
                 Ontario-Route 60 Near Road 
                                        530 
                                Pala Airpad 
                                        694 
                               Palm Springs 
                                        239 
                         Paradise - Theater 
                                        347 
                                   Pasadena 
                                        241 
                                   Pechanga 
                                        401 
                             Pico Rivera #2 
                                        118 
                               Pinnacles NP 
                                        238 
                             Piru - Pacific 
                                        475 
                      Pleasanton - Owens Ct 
                                        361 
              Point Reyes National Seashore 
                                        213 
                                Porterville 
                                        356 
                                    Portola 
                                        485 
                  Portola-161 Nevada Street 
                                        100 
                     Quincy-N Church Street 
                                        410 
       Red Bluff-Walnut St. District Office 
                                        347 
                          Redding - Buckeye 
                                         54 
                Redding - Health Department 
                                        427 
                            Redding - Toyon 
                                         59 
                               Redwood City 
                                        449 
                                 Redwood NP 
                                        229 
                                     Reseda 
                                        602 
                  Ridgecrest-California Ave 
                                        104 
                            Ridgecrest-Ward 
                                        355 
                       Riverside (Magnolia) 
                                        115 
                    Roseville-N Sunrise Ave 
                                        409 
                                   Rubidoux 
                                       1372 
Sacramento Health Department-Stockton Blvd. 
                                        154 
                   Sacramento-1309 T Street 
                                        826 
                    Sacramento-Bercut Drive 
                                        359 
                  Sacramento-Del Paso Manor 
                                        988 
                                  Salinas 3 
                                        530 
               San Andreas-Gold Strike Road 
                                        414 
                             San Bernardino 
                                        235 
               San Diego - Kearny Villa Rd. 
                                        172 
      San Diego - Sherman Elementary School 
                                        664 
             San Diego -Rancho Carmel Drive 
                                        119 
                         San Diego-12th Ave 
                                        352 
                              San Francisco 
                                        546 
                    San Gorgonio Wilderness 
                                        229 
                                   San Jose 
                                        217 
                         San Jose - 4th St. 
                                        141 
                         San Jose - Jackson 
                                        585 
                     San Jose - Knox Avenue 
                                        358 
           San Lorenzo Valley Middle School 
                                        345 
                  San Luis Obispo-Marsh St. 
                                         52 
                                  San Pablo 
                                        352 
                                 San Rafael 
                                        547 
                              Santa Barbara 
                                        349 
                              Santa Clarita 
                                        365 
                                 Santa Cruz 
                                        406 
                                Santa Maria 
                                        146 
                        Santa Rosa - 5th St 
                                         93 
                                 Sebastopol 
                                        345 
  Sequoia & Kings Canyon NPs - Ash Mountain 
                                        340 
                                 Sequoia NP 
                                        226 
                         Signal Hill (LBSH) 
                                        289 
                 Simi Valley-Cochran Street 
                                        873 
                                SLO Roberto 
                                        358 
                                Sloughhouse 
                                        342 
                 South Lake Tahoe-Sandy Way 
                                         93 
                 Stn.1 Big Pine Paiute site 
                                        650 
                 Stockton - University Park 
                                        348 
                          Stockton-Hazelton 
                                        124 
         Table Mountain Air Monitoring Site 
                                        339 
                   Tahoe City-Fairway Drive 
                                        355 
                                   Temecula 
                                        365 
                              Thousand Oaks 
                                        512 
                Torres Martinez Reservation 
                                        470 
                              Tracy-Airport 
                                        345 
                  TRAFFIC, RURAL PAVED ROAD 
                                        700 
                               Tranquillity 
                                        349 
                                    Trinity 
                                        139 
                       Truckee-Fire Station 
                                        545 
                                    Turlock 
                                        340 
                              Ukiah-Library 
                                        469 
                                     Upland 
                                        365 
                                    Vallejo 
                                        816 
                    Victorville-Park Avenue 
                                        938 
                             Visalia-Church 
                                        404 
                  Visalia-W. Ashland Avenue 
                                        393 
                     Weaverville-Courthouse 
                                        350 
                       Willits-Blosser Lane 
                                        351 
                      Willows-Colusa Street 
                                        328 
                                 WMRC/NCORE 
                                        755 
                       Woodland-Gibson Road 
                                        171 
                                Yosemite NP 
                                        340 
 Yosemite NP-Yosemite Village Vistor Center 
                                        522 
                                      Yreka 
                                        312 
                                  Yuba City 
                                        826 
table(epa_all$Site.ID)

60010007 60010009 60010011 60010012 60010015 60011001 60070002 60070008 
     439      365      359      360      361      105      120      409 
60072002 60074001 60090001 60110007 60111002 60130002 60131004 60150002 
     347      700      414       42      454      739      352      229 
60150007 60170011 60170012 60179000 60179001 60179002 60190008 60190011 
     332       93        2      113      111      117      499      533 
60190500 60192008 60192009 60192016 60195001 60195025 60199000 60210003 
     339      357      349      357      443      443      195      328 
60231002 60231004 60250003 60250005 60250007 60250010 60251003 60270002 
      59      116       85      613      348      470      450      755 
60270101 60271003 60271018 60271023 60271033 60290010 60290011 60290014 
     115      635      604      704      650      420      100     2007 
60290015 60290016 60290018 60290019 60292009 60299001 60299002 60310004 
     104      231      355      355      340      106       93      443 
60311004 60333001 60333002 60370002 60370016 60371002 60371103 60371201 
     361       61       61      415      361      122     1271      602 
60371301 60371302 60371601 60371602 60372005 60374002 60374004 60374008 
     122      723      118      118      241      411      239      574 
60374009 60374010 60376012 60379033 60379034 60392010 60410001 60410002 
     289      365      365      455      150      360      360      213 
60430003 60431001 60450006 60452003 60470003 60472510 60490001 60510001 
     340      522      469      351      359      449        2      110 
60510005 60519000 60530002 60530008 60531003 60570005 60571001 60590007 
     716      228      343      344      530      404      545     1120 
60592022 60610003 60610004 60610006 60611004 60612003 60631006 60631007 
     227      344      358      409      355      357      410      299 
60631009 60631010 60650009 60650012 60650016 60650500 60651003 60651016 
     100      485      401      365      365      767      115      370 
60652002 60655001 60658001 60658005 60659000 60659001 60670006 60670010 
     240      239     1372      777      168      360      988      826 
60670012 60670015 60674001 60675003 60690002 60690003 60710005 60710025 
     689      359      154      342      356      238      178      111 
60710027 60710306 60711004 60712002 60718001 60719002 60719004 60719010 
     530      938      365      298      450      215      235      229 
60730001 60730003 60730006 60730077 60731002 60731006 60731007 60731008 
     218      416      111      724      355      467      352      479 
60731014 60731016 60731017 60731022 60731026 60731201 60750005 60771002 
     507      172      119      631      664      694      546      124 
60771003 60772010 60773005 60792002 60792004 60792007 60792020 60798001 
     348      308      345       52      357      359      358      116 
60798002 60811001 60830011 60831008 60831009 60832004 60832011 60839000 
     345      449      349       60       86      351      359      187 
60850002 60850004 60850005 60850006 60852003 60870007 60871005 60890004 
     340      141      585      358      217      406      345      427 
60893003 60893004 60893005 60930005 60932001 60950004 60970003 60970004 
     209       54       59      216      312      816       93      345 
60990005 60990006 61010003 61030007 61050002 61059000 61070009 61071001 
     587      340      826      347      350      139      340      226 
61072002 61072003 61072010 61110007 61110009 61111004 61112002 61113001 
     404      393      356      512      475      360      873      469 
61130004 61131003 
     312      171 

It seems like the local site names are very variable. Some names are capitalised while others are not, which might make it difficult for data processing. Instead, using the Site ID can ensure consistency.

library(stringr)

la_sites <- epa_all %>% filter(County == "Los Angeles") %>% 
  group_by(Site.ID, Year, Local.Site.Name) %>% 
  summarise(med = median(MeanPM, .groups="drop"))
`summarise()` has grouped output by 'Site.ID', 'Year'. You can override using
the `.groups` argument.
ggplot(la_sites, aes(x = factor(Year), y = med, group = Site.ID)) + geom_line() + 
  geom_point() + facet_wrap(~ str_wrap(Local.Site.Name, width = 15), 
                            scales = "free_y") + 
  labs(x = NULL, y = "Median PM2.5 ug/m3", 
       title = "LA County site medians 2002 vs 2022")
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

# Repeating the same plots but with mean PM2.5 values. 
la_sites_mean <- epa_all %>% filter(County == "Los Angeles") %>% group_by(Site.ID, Year, Local.Site.Name) %>% 
  summarise(men = mean(MeanPM, .groups="drop"))
`summarise()` has grouped output by 'Site.ID', 'Year'. You can override using
the `.groups` argument.
ggplot(la_sites_mean, aes(x = factor(Year), y = men, group = Site.ID)) + 
  geom_line() + geom_point() + 
  facet_wrap(~ str_wrap(Local.Site.Name, width = 15), scales = "free_y") + 
  labs(x = NULL, y = "Mean PM2.5 ug/m3", 
       title = "LA County site means 2002 vs 2022")
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Summarizing the site level data:

site_level <- epa_all %>% filter(County == "Los Angeles") %>% 
  group_by(Local.Site.Name, Year) %>% 
  summarise(mean_PM = mean(MeanPM, na.rm = TRUE), 
            median_PM = median(MeanPM, na.rm = TRUE), 
            .groups = "drop")

most_change_SL <- site_level %>% 
  pivot_wider(names_from = Year, values_from = c(mean_PM, median_PM)) %>% 
  mutate(mean_change_SL = mean_PM_2002 - mean_PM_2022, 
         median_change_SL = median_PM_2002 - median_PM_2022)

most_change_mean_SL <- most_change_SL %>% arrange(desc(mean_change_SL)) %>% 
  select(Local.Site.Name, mean_change_SL)
most_change_median_SL <- most_change_SL %>% arrange(desc(median_change_SL)) %>% 
  select(Local.Site.Name, median_change_SL)

print(site_level)
# A tibble: 25 × 4
   Local.Site.Name              Year mean_PM median_PM
   <chr>                       <dbl>   <dbl>     <dbl>
 1 ""                           2002   23.9      21.4 
 2 "Azusa"                      2002   20.8      18.7 
 3 "Azusa"                      2022    9.72      9.65
 4 "Burbank"                    2002   24.0      21.6 
 5 "Compton"                    2022   13.0      11.9 
 6 "Glendora"                   2022    8.52      7.9 
 7 "Lancaster-Division Street"  2002   10.4      10   
 8 "Lancaster-Division Street"  2022    7.52      7.3 
 9 "Lebec"                      2002    4.82      4.8 
10 "Lebec"                      2022    3.50      3.4 
# ℹ 15 more rows
print(most_change_mean_SL)
# A tibble: 18 × 2
   Local.Site.Name                  mean_change_SL
   <chr>                                     <dbl>
 1 "Pasadena"                                11.2 
 2 "Azusa"                                   11.0 
 3 "Los Angeles-North Main Street"           10.4 
 4 "Long Beach (North)"                       9.55
 5 "Reseda"                                   8.14
 6 "Lancaster-Division Street"                2.86
 7 "Lebec"                                    1.32
 8 ""                                        NA   
 9 "Burbank"                                 NA   
10 "Compton"                                 NA   
11 "Glendora"                                NA   
12 "Long Beach (South)"                      NA   
13 "Long Beach-Route 710 Near Road"          NA   
14 "Lynwood"                                 NA   
15 "North Hollywood (NOHO)"                  NA   
16 "Pico Rivera #2"                          NA   
17 "Santa Clarita"                           NA   
18 "Signal Hill (LBSH)"                      NA   
print(most_change_median_SL)
# A tibble: 18 × 2
   Local.Site.Name                  median_change_SL
   <chr>                                       <dbl>
 1 "Pasadena"                                   9.9 
 2 "Azusa"                                      9.05
 3 "Los Angeles-North Main Street"              8.4 
 4 "Long Beach (North)"                         7.25
 5 "Reseda"                                     6.75
 6 "Lancaster-Division Street"                  2.7 
 7 "Lebec"                                      1.4 
 8 ""                                          NA   
 9 "Burbank"                                   NA   
10 "Compton"                                   NA   
11 "Glendora"                                  NA   
12 "Long Beach (South)"                        NA   
13 "Long Beach-Route 710 Near Road"            NA   
14 "Lynwood"                                   NA   
15 "North Hollywood (NOHO)"                    NA   
16 "Pico Rivera #2"                            NA   
17 "Santa Clarita"                             NA   
18 "Signal Hill (LBSH)"                        NA   

Interpretation: It seems like although there were no NA values for PM2.5, there are some PM2.5 values recorded without a Local.Site.Name attribute, and some Local.Site.Name attributes for whom PM2.5 was not recorded in one or the other year. Regardless, comparing the line plots for 2002 to 2022 for all sites in LA county, median and mean PM2.5 decreased in 2022. The summary statistics using pivot_wider also show us that Pasadena saw the most improvement in both mean and median PM2.5 in LA county.

CONCLUSION

From the exploratory data analysis, across CA, daily PM2.5 concentrations decreased from 2002 to 2022. Most counties and LA County sites show declines, with the largest improvements being in Riverside/Tulare. While 2022 had a few extremely polluted days (PM2.5 > 200 ug/m3), these are one-off incidents and not persistent. Overall, the evidence supports that PM2.5 levels decreased between 2002 and 2022 in California.